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I Abstract 

Q-i' We survey classical and recent developments in numerical linear algebra, focusing on two 

I issues: computational complexity, or arithmetic costs, and numerical stability, or perfor- 

mance under roundofl error. We present a brief account of the algebraic complexity theory 
I as well as the general error analysis for matrix multiplication and related problems. We em- 

• phasize the central role played by the matrix multiplication problem and discuss historical 

,— i' and modern approaches to its solution. 

^ ! 1 Computational complexity of linear problems 
o ! 

I— I. In algebraic complexity theory one is often interested in the number of arithmetic operations 

required to perform a given computation, modelled as a programme which receives an input 
^ ■ (a finite set of elements of some algebra) and performs a sequence of algebra operations (ad- 

. dition, subtraction, multiplication, division, and scalar multiplication). This is called the total 

\^ [ (arithmetic) complexity of the computation0 Moreover, it is often appropriate to count only 

O ' multiplications (and divisions), but not additions or multiplications by fixed scalars. These 

\^ ■ notions can be formalized |BCS97l Definition 4.7]. For now, let us invoke 

^ ' Notation. Let F be a field, let ¥[xi, . . . ,Xn] ^ A C F(xi,...,x„) be an F-algebra, and let 

O ' ^ = {^1, ■ ■ ■ ,^m} be a finite set of functions. The total arithmetic complexity of $ will be 

^ [ denoted -L^*($), and its multiplicative complexity by La{^)- 

^ ■ Intuitively, this is the minimal number of steps required to compute all of , . . . , ipm starting 

^ i from a generic input {xi, . . . ,a;„), with intermediate results in A (in all cases we consider, A 

■ ■ ■ • will simply be the algebra of polynomials or of rational functions in the input variables, and will 

not always be explicitly indicated). The input and all intermediate results are understood to be 
stored in memory, and the simultaneous computation of a set $ of functions means that at the 
end of the programme $ is contained in the set of resultsU 

Let U, V, and W be finite-dimensional vector spaces over F, and consider the class of bilinear 
functions ip: U x V ^ W (which includes matrix multiplication). To define the multiplicative 
complexity of such a function, choose bases {tij}i<i<m, {^i}i<i<n, and {tffc}i<fc<p, so that 

m n \ p 

(f I ^XiUi,^yjVj = '^(pkWk- 
=1 i=i / k=i 



(N 



^This model only counts the operations and completely ignores storage and communications costs, limitations 
on precision, details of how arithmetic is implemented (in particular we are not counting bit operations), etc. 

^For example, Xo := a;, X\ := Xo ■ Xq, X2 := Xi ■ X\, X3 := X2 ■ Xq is a programme, using 3 operations — 
solely multiplications in this example, which computes x'\ as well as any subset of {a;, x^, x*, x^}, in A — ¥[x]. 
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We regard the coefficients as variables, so that each is a homogeneous polynomial of degree 2 
in Xi, . . .,Xm,yi, ■■■,yn- 

Definition (Cf. |BCS97t Definition 14.2]). L{ip) = L^^^^,„^^^^y^^,„^y^]{{ipi, . . . ,ipp}) . 

Because we are considering the multiplicative complexity, this is a well-defined notion that does 
not depend on the choice of bases. 

It turns out that the multiplicative complexity of a bilinear function (p: U x V ^ W is 
controlled by a somewhat more well-behaved notion, the rank R{f)- This is a standard notion 
in multilinear algebra, which generalizes that of the rank of a linear map. 

Definition. Let t G Vi (8) • • • V^,. The rank R{t) is the smallest r such that one can write 
t = Yll=i with each ti a monomial tensor, i.e., of the form = f i (X" ■ ■ ■ f „ for some Vi G Vi. 

In case (f.UxV^Yis the bilinear map corresponding to a linear function (p: U V* , 
the rank R(^p) is the rank of ^ in the usual sense. Well-known algorithms, such as Gaussian 
elimination, as well as the fast algorithms described in this paper (see Section [L4l) . can quickly 
compute the rank of a matrix, but determining the rank of a tensor of order 3 already seems to 
be quite difficult. Computing the rank of a given tensor is a combinatorial or algebro-geometric 
problem [LanOSj . 

We now explain how the rank controls the complexity of a bilinear function. First, by a 
known result of Strassen (see |BCS97l Proposition 14.4]), if ip: V ^ W is a quadratic map 
between finite-dimensional vector spaces, that is, 

(/jf^XiWij = '^ipj{xi,. . . ,Xn)Wj 
\i=l / j=l 

for some bases {uj}i<j<n (resp. {wj}i<j<p) of V (resp. W) and homogeneous polynomials (pi £ 
¥[xi, . . . , Xn] of degree two, then we need not search through some rather large class of pro- 
grammes to find one which computes ip optimally, for in fact L{ip) = -Z^F[ai,...,a„] ({'/'i) • • • ^^p}) 
equals the smallest I > 1 such that 

I 

"fiv) =^fi{v)gi{v)wi (1) 

i=l 

for some linear functionals fi, gi & V* . (Note that such a formula immediately gives an obvious 
algorithm computing ip{v) using only I (non-scalar) multiplications.) 

Now let (p: U x V —i- W he & bilinear map between finite-dimensional vector spaces. This is 
covered by the preceding result of Strassen, since a bilinear map U x V ^ W may be regarded 
as a quadratic map via the isomorphism ¥[U x V] = ¥[U] ¥[V]. A bilinear algorithm for ip 
amounts to writing 

r 

ip{u,v) ='^fi{u)gi{v)wi (2) 
1=1 

for certain linear functionals fi £ U* , gi £ V* , and Wi G W. The minimum such r is the 
rank R{(p). Note that the rank of ip is not necessarily the same as its bilinear complexity, 
despite the superficially similar-looking formulae ([T]) and However, by decomposing a linear 
functional f : U x V ^ ¥ as f{u,v) = f{u,0) + f{0,v), one can see that 

L{ip) < R{ip) < 2L{ip). 

It is often easier to work with the rank rather than the more subtle notion of multiplicative (or 
total) complexity, and the above inequality shows we do not lose much in doing so. 
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1.1 Algebraic complexity of matrix multiplication 



The basic problem is to compute the (total or multiplicative) complexity of multiplying two 
n X n matrices. This is a difficult question whose answer is not at present known for n = 3, for 
instance. 

Matrix multiplication is a bilinear problem (see Section [T]) 

ip: M„x„(F) X M„xn(F) ^ M„xn(F) 
{X,Y)^XY= 

V'=l / l<i,j<n 

whose corresponding tensor will be denoted 



{n,n,n) := ^ Uij ® Vjk ® wui- 



For n = 2 Winograd proved |Win71j that seven multiplications are required, so L((2,2, 2)) = 
R{{2,2,2)) = 7, but for n = 3 even the rank is not known at present (it is known that 19 < 
i?((3,3,3)) < 23; see \BCSm Exercise 15.3], |Lann8j ). 

Instead of fixing n, one considers the asymptotic complexity of matrix multiplication: 



u;{¥) =mi{ r G 



L 



tot 




1 < i, j < n 



O(n^) 



(3) 



so that n X n matrices with entries in F may be multiplied using 0{n'^^^^'^'^) operations!! for 
every > 0. 

First of all, one can replace the total complexity in ([3|) by the multiplicative complexity or 
by the rank |BCS97l Proposition 15.1] and get the same exponent. Second, uj{¥) is invariant 
under extension of scalars |BCS97l Proposition 15.18], so it does not depend on the exact choice 
of field F (e.g., Q versus M or C), but rather only on its characteristic, which is usually taken to 
be zero (so to denotes ^(C)). 

The value of to is an important quantity in numerical linear algebra, as it determines the 
asymptotic complexity of not merely matrix multiplication but also matrix inversion, various 
matrix decompositions, evaluating determinants, etc. (see Sections 11.41 and [2.31) . 

An obvious bound is 2 < u; < 3, since the straightforward method of matrix multiplication 
uses O(n^) operations, on one hand, while on the other hand we need at least multiplications 
to compute independent matrix entries. The first known algorithm proving that a; < 3 was 
Strassen's algorithm, detailed in Section 12.11 which starts with an algorithm for multiplying 
2x2 matrices using seven multiplications and applies it recursively, giving cj < log2 7. This idea 
of exploiting recursion will be explored in the next section. 



1.2 Asymptotic bilinear complexity via tensor ranks 

The basic idea behind designing fast algorithms to multiply arbitrarily large matrices, thereby 
obtaining good upper bounds on w, is to exploit recursion: multiplication of large matrices can 
be reduced to several smaller matrix multiplications. One obvious way to do this is to decompose 

^Technically, division is not allowed, as the computation should be in F[Xij, Yij], although this is no restriction 
if F is an infinite field (see |BCS97I Remark 15.2]). 



3 



the matrix into blocks, as in Strassen's original algorithm. Strassen's "laser method" |BCS97l 
Section 15.8] is a sophisticated version of this, where several matrix-multiplication tensors are 
efficiently packed into a single bilinear operation (not necessarily itself a matrix multiplication). 
The rank of the tensor — in fact the border rank, which will be defined below — is used to keep 
track of the complexity of the resulting recursive algorithm, and appears in the resulting in- 
equality for uj. This idea of recursion is also behind the "group-theoretic" algorithms described 
in the next section. 

We have mentioned that the exponent of matrix multiplication may be defined in terms of 
the rank i?((n,n, n)): 

a;(F) = inf { T G M I i?((n, n, n)) = O(n^) } . 

The reason for dealing with the rank rather than directly with the complexity measure is that 
the rank is better behaved with respect to certain operations, and this will be useful for deriving 
bounds on the asymptotic complexity via recursion. In particular |BCS97[ Proposition 14.23], 

we have 

for bilinear maps 931 and (/?2, while the corresponding inequality with L in place of R is not 
known to be true. Let {e,h,l) be the tensor of Mexh x Mhxi — ^ -^ex/ matrix multiplication. 
Since {e,h,l) {e',h',l') ^ {ee',hh',W) |BCS97[ Proposition 14.26], we have R{{ee' , hh' , II')) = 
R{{e, h, /))i?((e', /i', I')). Using properties of the rank function, it is easy to derive bounds on co 
given estimates of the rank of a particular tensor. 

Example. If R{{h,h,h)) < r, then h"^ < r. 

The first generalization is to allow rectangular matrices, via symmetrization: we have 

R{{e,h,l)) =R{{h,l,e)) =R{{l,e,h)) 

(another nice property of the rank not shared by the multiplicative complexity), so if R[{e, h, /)) < 
r, then R{^{ehl, ehl, ehl)) < r^, and therefore 

(ehl)"^/'^ < r. (4) 

The next refinement is to multiply several matrices at once. But first we need to discuss 
border rank. The border rank appears as follows. The idea is that one may be able to approx- 
imate a tensor t of a certain rank by a family ti(e) = Yll=i''^i{^) ® ''Jii^) ® Wi{e) of tensors of 
possibly smaller rank, meaning 

e^~Hi{e) = t + 0{e) 

for some positive integer q. The border rank R{t) is the smallest r for which this is possible. 
This has a geometric interpretation, studied by Landsberg |Lan08j . 

The border rank is always less than or equal to the rank, and shares some of its proper- 
ties, including that of being hard to determine. Landsberg [Lan06| proved that i2((2,2,2)) = 
i?((2,2,2)) = 7, but for n = 3 the best result known is 14 < i?((3,3, 3)) < 21 (to be compared 
with the estimate 19 < i?((3,3,3)) < 23 mentioned before). 

The border rank may be strictly less than the rank. For instance, the rank of 

i = xi yi (g) (zi -I- Z2) -I- xi (g) 2/2 ® 21 + yi ^1 
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is 3, but its border rank is only 2: 



e ti{£) := e [{e - l)xi ®yi<^ zi + {xi + 6X2) ® {yi + £^2) "X) (zi + ez2)] = t + 0(e), 

as can be seen by expanding the left-hand side. 

The importance of the border rank is that, as in this example, the original tensor may be 
recovered from ti(e) by computing the coefficient of some power of e; in other words, from 
such an approximate algorithm for computing t we may recover an exact one. This expansion 
increases the number of monomials, so this does not help to compute t itself; the magic happens 
when we compute t"^^ for large N . Taking tensor powers corresponds to multiplying matrices 
recursively. 

The border rank replaces the rank in a refinement of JIJ, so that R{{e,h,l)) < r implies 
{ehl)^/^ < r. A bit of work, generalizing this to the case of several simultaneous matrix multi- 
plications, results in Schonhage's asymptotic sum inequality 



Prom these sorts of considerations, one can see that good bounds on the asymptotic com- 
plexity of matrix multiplication can be obtained by constructing specific tensors of small border 
rank which contain matrix tensors as components; this is the idea behind Strassen et al.^s laser 
method. 



The principle of the laser method |BCS97l Proposition 15.41] is to look for a tensor t, of 

small border rank, which has a direct-sum decomposition into blocks each of which is isomorphic 
to a matrix tensor, and whose support is "tight", ensuring that in a large power of t one can find 
a sufficiently large direct sum of matrix tensors. Then one can apply ([5|). 

This combinatorial method was used by Coppersmith and Winograd |CW90j to derive uj < 
2.376, the best estimate currently known. 

1.3 Group-theoretic methods of fast matrix multiplication 

As explained in the previous section, the general principle is to embed several simultaneous 
matrix multiplications in a single tensor, via some combinatorial construction to ensure that the 
embedding is efficient. 

A rough sketch of Cohn et a/.'s |CKSU05j "group-theoretic" algorithms is that they involve 
embedding matrix multiplication into multiplication in a group algebra C[G] of a finite group G. 
The embedding uses three subsets of G satisfying the "triple product property" to encode matrices 
as elements of the group algebra, so that the matrix product can be read off the corresponding 
product in C[G]. The number of operations required to multiply two matrices is, therefore, less 
than or equal to the number of operations required to multiply two elements of C[G]. As a ring, 
C[G] = Mrf^xdi(C) X ••• X M(i^xdr(*C)) where di,...,dr are the dimensions of the irreducible 
representations of G (see, for instance, [LamOll Chapter 3]). This isomorphism may be realized 
as a Pourier transform on G, which can be computed efficiently. In other words, multiplication 
in C[G] is equivalent to several smaller matrix multiplications, and one can apply the algorithm 
recursively in order to get a bound on lo. 

Cohn et a/.'s embedding is of a very particular type, based on the following triple product 
property: if there are subsets X, Y, Z C G such that xx' ^yy'^^zz' ^ = 1, then x = x\ y = y\ 
and z — 2;'. This realizes the \X\ x \Y\ by \Y\ x \Z\ matrix multiplication AB by sending Cixy 
to Yl/O-xyX'^y and hyiz to Yl^y'zy'^^z; the triple product property ensures that one can extract 
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the matrix product from the product in the group algebra by looking at the coefficients of x z 
for x G X and z G Z. 

It may be more convenient, as in the previous section, to encode several matrix multi- 
plications via the simultaneous triple product property: for Xj, 1^, Zj C H one should have 
Xix'j~^yjy'i,~^ Zf^z'-"^ = 1 — > i = j = k and Xi = x[, yi = y[, Zi = z[. It follows from ([5]) that 

i k 

We remark that the simultaneous triple product property in H reduces to the triple product 
property in the wreath product G = x Sym„, so the groups actually output by this method 
turn out rather large. 

Prom this initial description it is not at all clear what kinds of groups will give good bounds. 
To this end, Cohn et al. introduce several combinatorial constructions, analogous to those of 
Coppersmith and Winograd, which produce subsets satisfying the simultaneous triple product 
property inside powers of a finite Abelian group -ff , and hence the triple product property 
inside wreath products of H with the symmetric group. This reproduces the known bounds uj < 
2.376, etc. 

The group-theoretic method therefore provides another perspective on efficiently packing 
several independent matrix multiplications into one. In both cases the essential problem seems 
to be a combinatorial one, and one can state combinatorial conjectures which would imply oj = 2. 



1.4 Asymptotic complexity of other linear problems 

One can also use recursive "divide- and- conquer" algorithms to prove that the asymptotic com- 
plexity of other problems in linear algebra is the same as that of matrix multiplication. This 
justifies the emphasis placed on matrix multiplication in numerical linear algebra. 
As a simple example, we will begin with 

Example (matrix inversion). On one hand, we have the identity 





A 






I 













which shows that two n x n matrices may be multiplied by inverting a 3n x 3n matrix. This 
shows that if an invertible nxn matrix can be inverted in 0{n'^~^^) operations, then the product 
of two arbitrary nxn matrices can also be computed in 0{n^~^'^) operations. 
In the other direction, consider the identity 



A B\ _ fA-^ +A-^BS-^CA-^ -A-^BS- 
C D ~ [ -S-^CA-^ S'^ 



S := D-CA-^B 



This shows that inversion of ^) G M2nx2n(C) can be reduced to a certain (fixed) number of 
nxn matrix multiplications and inversionsG Unfortunately, the indicated inverses, e.g., A~^, 
may not exist. This defect may be remedied by writing (^-d) ~ ~ X*{XX*)~^ . Now 
XX* is a positive-definite Hermitian matrix, to which the indicated algorithm may be applied 
(both its upper-left block and its Schur complement will be positive-definite and Hermitian). 
We conclude that fast multiplication implies fast inversion of positive-definite Hermitian, and 
therefore of arbitrary (invertible), matrices. 



For instance, 2 inversions and 6 multiplications. 
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Example {LU decomposition). Suppose, for instance, that one wishes to decompose a matrix A 
as A = LUP, where L is lower triangular and unipotent, U is upper triangular, and P is a 
permutation matrix. Note that not every matrix has such a decomposition; a sufficient condition 
for it to exist is that A have full row rank. 

One can give a recursive algorithm [BCS971 Theorem 16.4], due to Bunch and Hopcroft, 
for computing the decomposition in case A has full row rank, via a 2 x 2 block decomposition 
of A. This involves one inversion of a triangular matrix, two applications of the algorithm to 
smaller matrices, and several matrix multiplications; we elide the details. Since multiplication 
and inversion can be done fast, analysis of this algorithm shows that if an n x n matrix can be 
multiplied in 0{n^^^) operations, then the LU decomposition of an m x n matrix can be done 
in 0{nm^~^^~^) operations, that is 0{rt^^'^) in the case of a square matrix. 

To show, conversely, that fast LU decomposition implies fast matrix multiplication, one notes 
that det A may be computed from an LU decomposition of A, and that computing determinants 
is at least as hard as matrix multiplication (cf. [BCS971 Theorem 16.7]). This shows that the 
exponents of matrix multiplication, LU decomposition, and determinants coincideH 

Further examples involving other linear problems may be found in the literature; see |BCS97j 
and also Section [2?3l 



2 Numerical stability of linear problems 

Numerical stability is just as important for the implementation of any algorithm as computa- 
tional cost, since accumulation and propagation of roundoff errors may significantly distort the 
output of the algorithm, making the algorithm essentially useless. On the other hand, if roundoff 
error bounds can be established for a given algorithm, this guarantees that its output values 
can be trusted to lie within the regions provided by the error bounds. Moreover, such regions 
can typically be made small by increasing the hardware precision appropriately. Fast matrix 
multiplication algorithms, from Strassen's algorithm to the recent group-theoretic algorithms of 
Cohn et al., can be analysed in a uniform fashion from the stability point of view |DDHK07] . 

The roundoff-error analysis of Strassen's method was first performed by Brent ( [Bre70[ 
Hig90| , see also |Hig02 chap. 23]). The analysis of subsequent Strassen-like algorithms is 



due a number of authors, most notably by Bini and Lotti [BL80J. This latter approach was 
advanced in |DDHKQ7j to build an inclusive framework that accommodates all Strassen-like 
algorithms based on stationary partitioning, bilinear algorithms with non-stationary partition- 
ing, and finally the group-theoretic algorithms of the kind developed in [CU03j and |CKSU05j . 
Moreover, combining this framework with a result of Raz [ RazOS] . one can prove that there 
exist numerically stable matrix multiplication algorithms which perform 0{n^^^) operations, 
for arbitrarily small ?7 > 0, where u) is the exponent of matrix multiplication. 

The starting point of the error analysis [DDHK07| is the so-called classical model of rounded 
arithmetic, where each arithmetic operation introduces a small multiplicative error, i.e., the 
computed value of each arithmetic operation op(a,6) is given by op(a,6)(l -|- 6) where \6\ is 
bounded by some fixed machine precision e but is otherwise arbitrary. The arithmetic operations 
in classical arithmetic are {-|-, — , •}. The roundoff errors are assumed to be introduced by every 
execution of any arithmetic operation. It is further assumed that all algorithms output the exact 
value in the absence of roundoff errors (i.e., when all errors are zero). 



'Compare this result on determinants with the problem of computing the permanent, which is NP-hard! 
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The error analysis can be performed with respect to various norms on the matrices A, B, 
C = AB, as will be made clear in the next section. It leads to error bounds of the form 



llCcomp - C\\ < fi{n)e\\A\\ \\B\\ + 0{e^), (6) 

with fj,{n) typically low-degree polynomials in the order n of the matrices involved, so that 
= 0{n^) for some constant c. Switching from one norm to another is always possible, using 
the equivalence of norms on a finite- dimensional space, but this may incur additional factors 
that depend on n. 

2.1 Recursive matrix multiplication: Strassen and beyond 

In his breakthrough paper [Str69| . Strassen observed that the multiplication of two 2x2 block 
matrices requires only 7 (instead of 8) block multiplications, and used that remarkable ob- 
servation recursively to obtain a matrix-multiplication algorithm with running time 0(n'°S2^). 
Precisely, the product 

/ An Bii Bi2\^f Cii Ci2 \ 

V ^21 A22 J V -^21 B22 J \ C21 C22 J ' 

can be computed by calculating the submatrices 

Ml = {All + A22)iBn + B22) 
M2 = (^21+^22)^11 

M3 = Aii{Bi2-B22) 

M4 = ^22(^21-^11) 
M5 = (^11+^12)^22 
Me = {A21 - Aii){Bii + B12) 

M7 = {A12 - A22){B2l + B22) 

and then combining them linearly as 

Cii = Ml M4 - M5 M7 

C12 = M3 + M5 

C21 = M2 + M4 

C22 = M1-M2 + M3 + M6. 

Starting with matrices of dyadic order, this algorithm can be applied by recursively parti- 
tioning each matrix into four square blocks and running these computations. This yields running 
time 0(n^°§2 7-) ~ ^^^^^2.81^ Since any matrix can be padded with zeros to achieve the nearest 
dyadic order, the dyadic size assumption is not restrictive at all. 

The breakthrough of Strassen generated a flurry of activity in the area, leading to a num- 
ber of subsequent improvements, among those by Pan |Pa,n78| . Bini et al. [BCRL79], Schon- 
hage |Sch81j . Strassen jStr87| . and eventually Coppersmith and Winograd [CW90| . Each of 
these algorithms is Strassen-like, i.e., uses recursive partitioning and a special "trick" to reduce 
the number of block matrix multiplications. 

Such recursive algorithms for matrix multiplication can be analysed as follows. Recall that 
bilinear functions can be evaluated via bilinear algorithms, as in Equation ([2]). Since they do not 
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use commutativity of the coordinates, these algorithms apply equally well when the input entries 
are elements of a non-commutative algebra; their recursive use for matrix multiplication is then 
straightforward. A bilinear non- commutative algorithm (see |BL80j or |BD78| ) that computes 
products C = AB in Mjtx/c(]F) using t non-scalar multiplications over a subfield EI C F (not 
necessarily equal to F)|l is determined by three fc^xt matrices C/, V and W with elements in H 
such that 



Chl 



^WrsPs, where = j ^Ui^Xi j j ^vj^yj j , ]^ i (7) 



, k, 



where Xi (resp. yj) are the elements of A= (uij) (resp. of -B = (bij)) ordered column-wise, and 
C = (cij) is the product C = AB. 

For an arbitrary n, the algorithm consists of recursive partitioning and applying ((T]) to 
compute products of resulting block matrices. More precisely, suppose that A and B are of 
size nxn, where n is a power of k (which can always be achieved by padding the matrices 
A and B with zero columns and rows, as we already mentioned). Partition A and B into 
k"^ square blocks Aij, Bij of size {n/k)x{n/k). Then the blocks C^i of the product C = AB 
can be computed by applying ((T]) to the blocks of A and -B, where each block Aij, Bij has 
to be again partitioned into k'^ square sub-blocks to compute the t products Pg and then the 
blocks Chl- The algorithm obtained by running this recursive procedure log^ n times computes 
the product C = AB using at most 0{n^°^'^^) multiplications. 

Theorem f [DDHKn7[ Theorem 3.1]). A bilinear non-commutative algorithm for matrix multi- 
plication based on stationary partitioning is stable. It satisfies the error bound ([6]) where || • || is 
the maximum-entry norm and where 



fi{n) = (l -|- m.ax{as + Ps + 7r + 3) log;, n) ■ (emax • \\U\\ \\V\\ \\W\\) 



Here as = \l0g2 a^] , Ps = \^og2 bg] and = \^og2 c^] where and bs (resp. c^) are the number 
of non-zero entries of U and V (resp. W) in column s (resp. row r), while emax is an integer 
that depends (in a rather involved way) on the sparsity pattern of the matrices U, V and W. 

This theorem can be subsequently combined with the result of Raz [Raz03j that the exponent 
of matrix multiplication is achieved by bilinear non-commutative algorithms [RazOSij to produce 
an important corollary: 

Corollary ( [DDHK071 Theorem 3.3]). For every r/ > there exists an algorithm for multiplying 
n-by-n matrices which performs 0{n'^~^'^) operations (where uj is the exponent of matrix multi- 
plication) and which is numerically stable, in the sense that it satisfies the error bound jB]) with 
/i(n) = 0{n'^) for some constant c depending on r/ but not n. 

The analysis of stationary algorithms extends to bilinear matrix multiplication algorithms based 
on non-stationary partitioning. This means that the matrices ^i"|comp and -Si'^lomp are parti- 
tioned into kxk square blocks, but k depends on the level of recursion, i.e., k = k{j), and the 
corresponding matrices U, V and W also depend on j: U = U{j), V = V{j), W = W{j). 
Otherwise the algorithm proceeds exactly like the stationary algorithms. 

Finally, algorithms that combine recursive non-stationary partitioning with pre- and post- 
processing given by linear maps Pre„() and Post„() acting on matrices of an arbitrary order n 



^Field extensions have no effect on the asymptotic complexity, but changing EI will affect the constants in 
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can be analysed using essentially the same approach [DDHK07| . Suppose that the matrices 
A and B are each (linearly) pre-processed, then partitioned into blocks, respective pairs of 
blocks are multiplied recursively and assembled into a large matrix, which is then (linearly) 
post-processed to obtain the resulting matrix C. 

The analysis in [DDHK07j is performed for an arbitrary consistent (i.e., submultiplicative) 
norm || • || that in addition must be defined for matrices of all sizes and must satisfy the condition 

max||Ms|| < ||M|| < ^ IIM^II (8) 

s 

whenever the matrix M is partitioned into blocks {Ms)s (an example of such a norm is provided 
by the 2-norm || • II2). Note that the previously mentioned maximum-entry norm satisfies ^ 
but is not consistent, i.e., does not satisfy 

\\AB\\ < \\A\\ ■ \\B\\ for all A, B. 

Denoting the norms of pre- and post- processing maps subordinate to the norm || • || by || • ||op, 
we suppose that the pre- and post-processing is performed with errors 

||PRE„(M)comp - PREn(M)||op < /p,e(n)e||M|| + 0(e2), 
||PosT„(M)comp-PosT„(M)||op < fpost{n)e\\M\\ + 0{e'^) , 

where n is the order of the matrix M. As before, we denote by /u(n) the coefficient of e in the 
final error bound 

Under all these assumptions, the following error estimate follows: 

Theorem ( [DDHK071 Theorem 3.5]). A recursive matrix multiplication algorithm based on non- 
stationary partitioning with pre- and post-processing is stable. It satisfies the error bound ([6]), 
with the function // satisfying the recursion 

Kn-j) = At(nj+i)ij||POST„J|op ||PRE„J|op + 2/pre(nj)tj||P0ST„J|op + /post(?^i)||PRE„J|op 

for j = 1, . . . ,p. 

2.2 Group-theoretic matrix multipHcation 

In this section we describe the group-theoretic constructions of Cohn et al. Our exposition 
closely follows the pertinent parts of |DDHK07] . To give a general idea about group-theoretic 
fast matrix multiplication, we must first recall some basic definitions from algebra. 

Definition (semidirect product). If H is any group and Q is a group which acts (on the left) 
by automorphisms of H ^ with q ■ h denoting the action of q G Q on h & H, then the semidirect 
product H yi Q is the set of ordered pairs {h, q) with the multiplication law 

{hi,qi){h2,q2) = {hi{qi ■ h2),qiq2)- (9) 

We will identify H x {1q} with H and {Ih} x Q with Q, so that an element {h,q) G H xQ may 
also be denoted simply by hq. Note that the multiplication law of x Q implies the relation 
qh={q- h)q. 
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Definition (wreath product). If H is any group, S is any finite set, and Q is a group with a 
left action on S, the wreath product H IQ is the semidirect product (H^) x Q where Q acts on 
the direct product of |5| copies of H by permuting the coordinates according to the action of Q 
on S. (To be more precise about the action of Q on , if an element h £ is represented as 
a function h: S H, then q ■ h represents the function s i— h{q~^{s)).) 

Definition (triple product property, simultaneous triple product property). If -ff is a group and 
X, Y, Z are three subsets, we say X, Y, Z satisfy the triple product property if it is the case 
that for all Qx G QiX), Qy € QiY), G QiZ), if qxqyqz = 1 then qx = Qy = Qz = 1- Here 
Q(X) = Q{X,X) is the set of quotients; Q{S,T) := { st''^ \ s e S, t e T } <Z H . 

If { {Xi,Yi,Zi) I i G / } is a collection of ordered triples of subsets of H, we say that this 
collection satisfies the simultaneous triple product property (STPP) if it is the case that for 
ah i, j, k £ I and all qx G Q{Xi,Xj), qy G Q{Yj,Yk), qz G Q{Zk,Zi), if qxqyqz = 1 then 

= Q'y = = 1 and i = j = k. 

Definition (Abelian STP family). An Abelian S TP family with growth parameters {a, (3) is a 
collection of ordered triples {Hn, Tjv, /cat), defined for all > 0, satisfying 

1. Hf^ is an Abelian group. 

2. Tjy = { (Aj, Yi, Zi) I z = 1, 2, . . . , A^ } is a collection of A^ ordered triples of subsets of Hj\f 
satisfying the simultaneous triple product property. 

3. \Hn\ = A^"+"(i). 

4. = nf=i i^.i = nf=i m = nf=i = a^^^+°(^). 

Recall from Section 11.31 that in [CKSUOSj matrix-multiplication algorithms are constructed 
based on families of wreath products of Abelian groups. 

To get into more details, we must recall basic facts about the discrete Fourier transform 
of an Abelian group. For an Abelian group H, let H denote the set of all homomorphisms 
from H to S^, the multiplicative group of complex numbers with unit modulus. Elements of H 
are called characters and are usually denoted by the letter x- The sets H, H have the same 
cardinality. When Hi, H2 are two Abelian groups, there is a canonical bijection between the 
sets Hi X H2 and {Hi x -^2)^; this bijection maps an ordered pair {XI1X2) to the character x 
given by the formula xihi, /12) = Xi(^i)X2(/i2)- Just as the symmetric group Sym^ acts on H^ 
via the formula a ■ {hi, h2, ■ ■ ■ , hn) = (/lo--i(i)) ^0—1(2)) • • • > ^CT-i(n))) there is a left action of Sym^ 
on the set defined by the formula a ■ {xi,X2, ...,Xn) = {Xa-^i),Xa-^2), ■ ■ ■,Xa-^n))- 

Notation. The notation E{H^) will be used to denote a subset of iJ" containing exactly one 
representative of each orbit of the Sym„ action on H^. An orbit of this action is uniquely 
determined by a multiset consisting of n characters of H, so the cardinality of S(i/") is equal 
to the number of such multisets, i.e. ('^'^~"'^)- 

Given an Abelian STP family, the corresponding recursive matrix multiplication algorithm 
is defined as follows. Given a pair of n-by-n matrices A,B, find the minimum A^ such that 
k]\f ■ N\ > n, and denote the group Hn by H. If A^! > n, multiply the matrices using an 
arbitrary algorithm. (This is the base of the recursion.) Otherwise reduce the problem of 
computing the matrix product AB to ('^'^~^) instances of A^! x A^! matrix multiplication, 
using a reduction based on the discrete Fourier transform of the Abelian group H^ . 
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Padding the matrices with additional rows and columns of O's if necessary, one may assume 
that kj^ ■ N\ = n. Define subsets X,Y, Z <^ H I Sym^ as 



N 



N 



N 



X=mxJxSym^, 



Sym 



TV) 



Z= \ l{Zi \ xSym 



N ■ 



\i=l 



\i=l 



\i=l 



These subsets satisfy the triple product property |CKSU05| . Note that |X| = |y| = \Z\ = n. 
Now treat the rows and columns of A as being indexed by the sets X, Y, respectively; treat the 
rows and columns of B as being indexed by the sets Y, Z, respectively. 

The algorithm uses two auxiliary vector spaces C[H I Syuij^] and C[H^ x Sym^Yli ^ach of 
dimensionality IH]'^ N\ and each with a specific basis: the basis for C[H I Sym^y] is denoted by 
{sg \ g € H I Symjv } , and the basis for C [H^ x Sym^] is denoted by { e^^o- | x ^ , a G Symjy } • 

The Abelian STP algorithm from [CKSU05] performs the following steps, which will be 
labelled according to whether they perform arithmetic or not. (For example, a permutation of 
the components of a vector does not involve any arithmetic.) 



1. Embedding (no arithmetic): Compute the following pair of vectors in C[HlSym 



2. Fourier transform (arithmetic): Compute the following pair of vectors in C[i/^ x 
Sym^]. 



3. Assemble matrices (no arithmetic): For every x G '^{H^), compute the following 
pair of matrices A^, B^, whose rows and columns are indexed by elements of Sym^. 



AX 
^pcr 



4. Multiply matrices (arithmetic): For every x G '^{H ), compute the matrix prod- 
uct := A^B^ by recursively applying the Abelian STP algorithm. 

5. Disassemble matrices (no arithmetic): Compute a vector c:=Yl^ ^ c^^a^x,cr G 
C[H'^ X Sym^] whose components c^^a are defined as follows. Given x, cr, let xo ^ H(i/^) 
and r G Sym^ be such that X = t ' Xo- Let 

p. — r<xo 
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6. Inverse Fourier transform (arithmetic): Compute the following vector c G C[H I 
Sym^]. 



c : = 



, \H\^ 



7. Output (no arithmetic): Output the matrix C = {Cxz) whose entries are given by 
the formula 

Cxz '■— C^-lz- 

The main result of |DDHK07] establishes the numerical stability of all Abelian STP algorithms. 

Theorem f |DDHKn7[ Theorem 4.13]). If { (i?Ar, Tat, /cat) } is an Abelian STP family with 
growth parameters (a,/?), then the corresponding Abelian STP algorithm is stable. It satisfies 
the error bound ([6]), with the Frobenius norm and the function ^ of order 

;,(n) = n^ + °«. 

Remark ( |DDHK07l Remark 4.15]). The running time of an Abelian STP algorithm can also be 
bounded in terms of the growth parameters of the Abelian STP family. Specifically, the running 
time is [CKSUQSj O (n("~-^)/^ + °(^)) . Note the curious interplay between the two exponents, 
(a — l)//3 and (a + 2)/2/3: their sum is always bigger than 3, since a > 2/3 + 1 is one of the 
requirements for an Abelian STP construction: 

a — 1 a + 2 3a 6/3 + 3 
\ = — > — > 3. 

/3 2/3 2/3-2/3 

2.3 Matrix decompositions and other linear problems 

The results about matrix multiplication from the previous section can be extended to show that 
essentially a// linear algebra operations can also be done stably, in time 0{n^) or 0{n^~^^), for 
arbitrary r] > |DDH07) . For simplicity, whenever an exponent contains "+?/", it will henceforth 
mean "for any rj > 0". Below we summarize the main results of |DDH07j . 

The first result in |DDHn7| can be roughly summarized by saying that n-by-n matrices can 
be multiplied in 0{n^~^^) operations if and only if n-by-n matrices can be inverted stably in 
0(^n^+^) operations. Some extra precision is necessary to make this claim; the cost of extra 
precision is included in the 0{n^) factor. 

Other results in [DDH07j may be summarized by saying that if n-by-n matrices can be 
multiplied in 0(n'^"'"'') arithmetic operations, then the QR decomposition can be computed 
stably (moreover, linear systems and least squares problems can be solved stably) in 0{n^^^) 
arithmetic operations. These results do not require extra precision, which is why one needs to 
count arithmetic operations rather than bit operations. 

The QR decomposition can be used to stably compute a rank-revealing decomposition, the 
(generalized) Schur form, and the singular value decomposition, all in 0{n^^'^) arithmetic oper- 
ations. Computing (generalized) eigenvectors from the Schur form, can be done by solving the 
(generalized) Sylvester equation, all of which can be done stably in 0{n^'^'^) hit operations. 

Here are a few more details about the work in [DDHn7| . The paper starts off by re- 



viewing conventional block algorithms used in libraries like LAPACK lABB+QQj and ScaLA 



PACK [BCCiOTj. The normwise backward stability of these algorithms was shown earlier 



|Hig90 IDHS951 Hig02| using ^ as an assumption. This means that these algorithms are 
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guaranteed to produce the exact answer (e.g., solution of a linear system) for a matrix C close 
to the actual input matrix C, where close means close in norm: 

||C'-C||=0(e)||C||. 

Here the 0(e) is interpreted to include a factor rf' for a modest constant c. 

The running-time analysis of these block algorithms in [DDHQTj shows that these block 

9-27 

algorithms run only as fast as 0(n ^-t ) operations, where 0{ri^) is the operation count of 
matrix multiplication, with 7 used instead of a; + t/ to simplify notation. Even if 7 were to 
drop from 3 to 2, the exponent would only drop from 3 to 2.5, providing only a partial 

improvement. However, further results in |DDH07j demonstrate that one can do better. 

The next step in [DDHQTj is the application of known divide-and-conquer algorithms for 
reducing the complexity of matrix inversion to the complexity of matrix multiplication. These 
algorithms are not backward stable in the conventional sense. However, they can be shown to 
achieve the same forward error bound (bound on the norm of the error in the output) as a 
conventional backward stable algorithm, provided that they use just ©(log'' n) times as many 
bits of precision in each arithmetic operation (for some p > 0) as a conventional algorithm. Such 
algorithms are called logarithmically stable. 

Incorporating the cost of this extra precise arithmetic in the analysis only increases the total 
cost by a factor at most log^^ n. Therefore, if there are matrix multiplication algorithms running 
in 0{n^~^^) operations for any > 0, then these logarithmically stable algorithms for operations 
like matrix inversion also run in 0{n^~^'^) operations for any > 0, and satisfy the same error 
bound conventional algorithm. 

A divide-and-conquer algorithm for QR decomposition from [EGOOj is simultaneously back- 
ward stable in the conventional normwise sense (i.e., without extra precision), and runs in 
Q(ji^+'n^ operations for any r/ > 0. This algorithm may be in turn used to solve linear sys- 
tems, least-squares problems, and compute determinants equally stably and fast. The same idea 
applies to LU decomposition but stability depends on a particular pivoting assumption |DDH07| . 

The QR decomposition can then be used to compute a rank-revealing URV decomposition 
of a matrix A. This means that U and V are orthogonal, R is upper triangular, and R reveals 
the rank of A in the following sense: Suppose ai > ■ ■ ■ > an are the singular values of A. Then 
for each r, crinin(^(l : 1 : r)) is an approximation of ar and crmax(-R('^ + 1 : n, r + 1 : n)) 
is an approximation of Gr+i- The algorithm in [DDH07j is randomized, in the sense that the 
approximations of Ur and ar+i are reasonably accurate with high probability. 

Finally, the QR and URV decompositions in algorithms for the (generalized) Schur form 
of nonsymmetric matrices (or pencils) [BDG97| lower their complexity to 0(n'^+'') arithmetic 
operations while maintaining normwise backward stability. The singular-value decomposition 
may in turn be reduced to solving an eigenvalue problem with the same complexity. Computing 
(generalized) eigenvectors can only be done in a logarithmically stable way from the (general- 
ized) Schur form. This is done by providing a logarithmically stable algorithm for solving the 
(generalized) Sylvester equation, and using this to compute eigenvectors. 

This covers nearly all standard dense linear algebra operations (LU decomposition, QR 
decomposition, matrix inversion, linear equation solving, solving least squares problems, com- 
puting the (generalized) Schur form, computing the SVD, and solving (generalized) Sylvester 
equations) and shows that all those problems can be solved stably and asymptotically as fast 
as the fastest matrix multiplication algorithm that may ever exist (whether the fastest matrix 
multiplication algorithm is stable or not). For all but matrix inversion and solving (generalized) 
Sylvester equations, stability means backward stability in a normwise sense, and the complexity 
is measured by the usual count of arithmetic operations. 
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